Course: Applied Geo-data Science at University of Bern (Institute of Geography)
Supervisor: Prof. Dr. Benjamin Stocker
Adviser: Dr. Koen Hufkens, Pepa Aran, Pascal Schneider
Further information: https://geco-bern.github.io/agds/
You have questions to the workflow? Contact the Author:
Author: Bigler Patrick (patrick.bigler1@students.unibe.ch)
Matriculation number: 20-100-178
Reference: Report Exercise 3 (Chapter 8)
Modelling is a important aspect of data sciences. But there are different types of models with different complexity. With a realistic problem, the following goals should be trained:
Understand the basics of regression and classification models.
Fit linear and logistic regression models in R.
Choose and calculate relevant model performance metrics.
Evaluate and compare regression models.
Detect data outliers.
Select best predictive variables.
In environmental and climate science, dozens of variables have been measured. With a model we want to try to predict a target. For that we use a linear model. Unfortunately, we do not know how many variables and which variables we should use. We only know that we try to find a formula like this:
\[ \hat{y}=a_0 +\sum_{n=1}^{N}\hat{a}_n*x_n \]
To build up the best linear regression model we could, of course, simply try all combinations by permutation. But that would be a waste of resources. However, we cannot avoid a try and error approach. But we would like to reduce the number of combinations. We implement a algorithm which works as followed:
Let \(p\in[1,N]\) and a predictor. Let \(N\in[1,max(variable)]\). N could be all variables we have or only a subset of it. Our algorithm will decide this.
The algorithm takes p and create a linear model. The p with the highest RSQ will be used as a predictor.
The algorithm takes p+1 and create a linear model. The p with the highest RSQ will be used as a predictor. Select the model with the highest RSQ and calculate the AIC.
Iterate step 3 until the AIC of the model do not decreasing anymore.
We have found the (presumably) optimal model.
For the first task we apply only the first 2 steps because we want only a bivariate model. For the second task we iterate the step 3. If we do that we might also be able to chose the best multivariate model.
All analysis are done with the open source software R-Studio (Version 2022.12.0+353). This software use the language R to handle the data. For an efficient processing of the data we use the R-package “Tidyverse (and others).
The following code chunk contains all packages we need. Important is the package “conflicted”. It enables us to chose if different functions have the same call but do not make the same thing (a function in a certain package can have the same name as a function from another package).
source("../../R/general/packages.R")
First, we get the data and save it in our repository. But we must do this only once. If the file exists, then we can read it directly. That is why we implemented a if-else statement.
# We want know, if a certain file already exists
name.of.file <- "../../data/re_stepwise/FLX_CH_stepwise.csv"
# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
# Access to the data
url <- "https://raw.githubusercontent.com/geco-bern/agds/main/data/df_for_stepwise_regression
.csv"
# Read in the data directly from URL
database.S1 <- read.table(url, header = TRUE, sep = ",")
# Write a CSV file in the respiratory
write_csv(database.S1,"../../data/re_stepwise/FLX_CH_stepwise.csv")
# Read the file
database <- read_csv("../../data/re_stepwise/FLX_CH_stepwise.csv")
# If exists such a file, read it only!
# Read the file
}else{database <- read_csv("../../data/re_stepwise/FLX_CH_stepwise.csv")}
To make a good decision about the predictors we need as much information as possible. For that we read our file as a data frame. After that, we want a overview about the data ( what variables contains teh data set? What are the main statistical parameters?)
# Check wheater there are any missing values (We try with only the first six rows)
head(apply(database,2, function(x) is.na(x)))
## siteid TIMESTAMP TA_F SW_IN_F LW_IN_F VPD_F PA_F P_F WS_F TA_F_MDS
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## SW_IN_F_MDS LW_IN_F_MDS VPD_F_MDS CO2_F_MDS PPFD_IN GPP_NT_VUT_REF USTAR
## [1,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [2,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [3,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [4,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [5,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [6,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
# Load the function
source("../../R/general/function.visualize.na.values.R")
# Function call to vis. the NA
visualize.na.values.without.groups(database)
# Take a overview and main statistic parameters
summary(database)
## siteid TIMESTAMP TA_F SW_IN_F
## Length:23011 Min. :1996-01-01 Min. :-29.184 Min. : 0.156
## Class :character 1st Qu.:2002-12-01 1st Qu.: 0.431 1st Qu.: 52.933
## Mode :character Median :2007-02-15 Median : 7.286 Median :120.679
## Mean :2006-10-25 Mean : 6.950 Mean :137.633
## 3rd Qu.:2011-01-23 3rd Qu.: 13.313 3rd Qu.:215.243
## Max. :2014-12-31 Max. : 31.166 Max. :398.192
##
## LW_IN_F VPD_F PA_F P_F
## Min. :138.1 Min. : 0.000 Min. : 80.37 Min. : 0.000
## 1st Qu.:265.0 1st Qu.: 0.831 1st Qu.: 84.31 1st Qu.: 0.000
## Median :300.1 Median : 2.542 Median : 97.38 Median : 0.010
## Mean :295.8 Mean : 3.787 Mean : 93.47 Mean : 2.320
## 3rd Qu.:328.9 3rd Qu.: 5.239 3rd Qu.: 98.77 3rd Qu.: 1.749
## Max. :420.1 Max. :33.290 Max. :103.28 Max. :186.400
##
## WS_F TA_F_MDS SW_IN_F_MDS LW_IN_F_MDS
## Min. :0.106 Min. :-29.184 Min. : 0.156 Min. :138.1
## 1st Qu.:1.805 1st Qu.: 0.446 1st Qu.: 53.317 1st Qu.:269.5
## Median :2.387 Median : 7.356 Median :120.031 Median :303.3
## Mean :2.670 Mean : 6.987 Mean :137.482 Mean :299.4
## 3rd Qu.:3.286 3rd Qu.: 13.301 3rd Qu.:215.580 3rd Qu.:332.9
## Max. :9.928 Max. : 31.166 Max. :398.192 Max. :420.1
## NA's :325 NA's :341 NA's :11057
## VPD_F_MDS CO2_F_MDS PPFD_IN GPP_NT_VUT_REF
## Min. : 0.000 Min. :297.3 Min. : -0.249 Min. :-6.683
## 1st Qu.: 0.852 1st Qu.:372.1 1st Qu.: 91.040 1st Qu.: 0.802
## Median : 2.612 Median :383.7 Median :231.956 Median : 2.845
## Mean : 3.850 Mean :385.1 Mean :271.714 Mean : 3.503
## 3rd Qu.: 5.326 3rd Qu.:395.6 3rd Qu.:430.028 3rd Qu.: 5.600
## Max. :33.290 Max. :724.2 Max. :995.106 Max. :18.511
## NA's :768 NA's :290 NA's :4574 NA's :2426
## USTAR
## Min. :0.0749
## 1st Qu.:0.2850
## Median :0.3856
## Mean :0.4255
## 3rd Qu.:0.5244
## Max. :1.5941
## NA's :2468
We chose the best variable for our predictor. For that, we try all variables and chose the one with the highest RSQ (see introduction)
# Access the outsourced function for the bivariate model
source("../../R/re_stepwise/function.bivariate.model.R")
# Function call
tibble.bi.model <- model.fitter(database, 16)
knitr::kable(tibble.bi.model)
| Target | Predictor | RR | AIC | Fit |
|---|---|---|---|---|
| GPP_NT_VUT_REF | siteid | 0.0458181 | 46735.32 | NO |
| GPP_NT_VUT_REF | TIMESTAMP | 0.0047922 | 47597.89 | NO |
| GPP_NT_VUT_REF | TA_F | 0.3871016 | 37619.26 | NO |
| GPP_NT_VUT_REF | SW_IN_F | 0.4306405 | 36102.41 | NO |
| GPP_NT_VUT_REF | LW_IN_F | 0.1676259 | 43919.98 | NO |
| GPP_NT_VUT_REF | VPD_F | 0.1908666 | 43337.05 | NO |
| GPP_NT_VUT_REF | PA_F | 0.0002403 | 47691.83 | NO |
| GPP_NT_VUT_REF | P_F | 0.0020894 | 47653.72 | NO |
| GPP_NT_VUT_REF | WS_F | 0.0082500 | 47526.24 | NO |
| GPP_NT_VUT_REF | TA_F_MDS | 0.3871137 | 37559.45 | NO |
| GPP_NT_VUT_REF | SW_IN_F_MDS | 0.4300828 | 36065.75 | NO |
| GPP_NT_VUT_REF | LW_IN_F_MDS | 0.1209379 | 25130.95 | NO |
| GPP_NT_VUT_REF | VPD_F_MDS | 0.1811937 | 42567.59 | NO |
| GPP_NT_VUT_REF | CO2_F_MDS | 0.0295658 | 47078.98 | NO |
| GPP_NT_VUT_REF | PPFD_IN | 0.4520702 | 29689.54 | BEST FIT |
| GPP_NT_VUT_REF | GPP_NT_VUT_REF | NA | NA | NA |
| GPP_NT_VUT_REF | USTAR | 0.0000316 | 45048.57 | NO |
There are meaningful scatter-plot for bivariate lm -models and it gives us a first impression. But we want to know more about the coefficients, the intercept, the distribution and outliers. For that we use summary() and we plot the model (see discussion).
# Calculate the probebly best multivariate model
best.bi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN, data = database)
# Load the function
source("../../R/re_stepwise/function.vis.bi.model.R")
# Plot the bivariate model
vis.bi.model(database)
# Overview about all important model parameter
# Model Call. We are also interested for the p-values
summary(best.bi.lm)
##
## Call:
## lm(formula = GPP_NT_VUT_REF ~ PPFD_IN, data = database)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0475 -1.3255 -0.3758 1.2780 12.3295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.72533 0.03057 23.73 <2e-16 ***
## PPFD_IN 0.01062 0.00009 117.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 16872 degrees of freedom
## (6137 observations deleted due to missingness)
## Multiple R-squared: 0.4521, Adjusted R-squared: 0.452
## F-statistic: 1.392e+04 on 1 and 16872 DF, p-value: < 2.2e-16
# Plot the model
plot(best.bi.lm)
What is happen with RSQ? Here we visualize the RSQ for each Variable. We can easely see that PPFD_IN has the highest RSQ and that is why we chose it as our predictor. But what if we want a multivariate regression? One might think that we should simply choose the variable with the second highest RSQ. But we will see that this is generally not the case.
# Load the function
source("../../R/re_stepwise/function.vis.and.dev.bi.R")
# Function call
Vis_of_the_develop.bi.model(tibble.bi.model)
Here we want to answer the question which predictors we should choose to get the probably best linear regression model. For that we have wrote a function. It works as described in the introduction. We make two function calls. For the first we use all variables but the target. This is, however, not necessarily the best way. We could say that the time factor and the location is irrelevant and we drop out these variables. We can see that the difference is huge.
# Access the outsourced function of the multivariate model
source("../../R/re_stepwise/function.multivariate.model.R")
# Function call with all column but target (column number 16)
multi.model.1 <- multivariate.model(database, c(16))
## [1] "Best model fit has AIC = 21762.1756977506"
## [1] "PPFD_IN" "LW_IN_F" "siteid" "VPD_F" "TA_F_MDS" "SW_IN_F"
## [7] "WS_F" "CO2_F_MDS" "P_F" "PA_F"
# Function call without the variables siteid, TIMESTAMP and GPP_NT_VUT_REF
multi.model.2 <- multivariate.model(database, c(1,2,16))
## [1] "Best model fit has AIC = 24469.2854126171"
## [1] "PPFD_IN" "LW_IN_F" "VPD_F" "TA_F_MDS" "SW_IN_F" "P_F"
## [7] "WS_F" "CO2_F_MDS" "PA_F"
# Overview about the models
knitr::kable(multi.model.1)
| Predictors | RSQ | AIC |
|---|---|---|
| PPFD_IN | 0.4520702 | 29689.54 |
| LW_IN_F | 0.5301754 | 27096.52 |
| siteid | 0.5981762 | 24464.35 |
| VPD_F | 0.6160337 | 23699.27 |
| TA_F_MDS | 0.6454925 | 22354.30 |
| SW_IN_F | 0.6513971 | 22072.88 |
| WS_F | 0.6564185 | 21830.06 |
| CO2_F_MDS | 0.6571345 | 21796.85 |
| P_F | 0.6577174 | 21770.14 |
| PA_F | 0.6579195 | 21762.18 |
| TIMESTAMP | 0.6579273 | 21763.80 |
knitr::kable(multi.model.2)
| Predictors | RSQ | AIC |
|---|---|---|
| PPFD_IN | 0.4520702 | 29689.54 |
| LW_IN_F | 0.5301754 | 27096.52 |
| VPD_F | 0.5746412 | 25420.80 |
| TA_F_MDS | 0.5881248 | 24879.25 |
| SW_IN_F | 0.5918328 | 24728.65 |
| P_F | 0.5945354 | 24618.55 |
| WS_F | 0.5963983 | 24542.84 |
| CO2_F_MDS | 0.5981106 | 24473.10 |
| PA_F | 0.5982491 | 24469.29 |
| TA_F | 0.5982635 | 24470.68 |
We visualize the development of the model. We can see that the model with more variables has a higher RSQ. That is not a surprise because RSQ will always increase if we add a variable. Important is the AIC. We can see that the AIC is lesser for more variables (21762.2 vs. 24469.3). The RSQ is higher (0.6579 vs. 0.5983). It follows that model 1 is better than model 2. We see that also in the coefficients.
# Load the function
source("../../R/re_stepwise/function.vis.and.dev.multi.R")
# Function call
vis.and.dev.multi.model(multi.model.1)
vis.and.dev.multi.model(multi.model.2, c("[without siteid and TIMESTAMP]"))
There are no meaningful scatter-plot for multidimensional models. But we want to know more about the coefficients, the intercept, the distribution and outliers. We make some calls by using summary(). We get the same values for the RSQ. This is proof that our algorithm works. We see that not all coefficients are significant.
# Calculate the probably best multivariate model
best.multi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS +
SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F + siteid + TIMESTAMP,
data = database)
second.multi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS +
SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F,
data = database)
# Overview about all important model parameter
# Model Call
summary(best.multi.lm)
##
## Call:
## lm(formula = GPP_NT_VUT_REF ~ PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS +
## SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F + siteid + TIMESTAMP,
## data = database)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3803 -1.1726 -0.0431 1.1877 11.4929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.833e+00 1.609e+00 -3.004 0.00267 **
## PPFD_IN 3.098e-03 4.833e-04 6.411 1.49e-10 ***
## LW_IN_F 7.823e-03 6.821e-04 11.469 < 2e-16 ***
## VPD_F -3.238e-01 7.285e-03 -44.455 < 2e-16 ***
## TA_F_MDS 1.659e-01 4.972e-03 33.361 < 2e-16 ***
## SW_IN_F 1.680e-02 9.998e-04 16.808 < 2e-16 ***
## P_F -1.290e-02 2.495e-03 -5.172 2.35e-07 ***
## WS_F -1.900e-01 1.355e-02 -14.020 < 2e-16 ***
## CO2_F_MDS -3.826e-03 8.153e-04 -4.693 2.71e-06 ***
## PA_F 5.735e-02 1.827e-02 3.139 0.00170 **
## siteidCH-Lae 8.313e-01 1.889e-01 4.400 1.09e-05 ***
## siteidFI-Hyy -5.900e-02 2.974e-01 -0.198 0.84274
## siteidFR-Pue -1.834e+00 2.782e-01 -6.590 4.52e-11 ***
## TIMESTAMP -6.450e-06 1.047e-05 -0.616 0.53783
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.905 on 16860 degrees of freedom
## (6137 observations deleted due to missingness)
## Multiple R-squared: 0.6579, Adjusted R-squared: 0.6577
## F-statistic: 2494 on 13 and 16860 DF, p-value: < 2.2e-16
summary(second.multi.lm)
##
## Call:
## lm(formula = GPP_NT_VUT_REF ~ PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS +
## SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F, data = database)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7477 -1.2228 -0.1153 1.1085 12.3241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5891656 0.4463032 -3.561 0.000371 ***
## PPFD_IN 0.0067222 0.0005163 13.021 < 2e-16 ***
## LW_IN_F 0.0152432 0.0006854 22.240 < 2e-16 ***
## VPD_F -0.3885851 0.0077764 -49.970 < 2e-16 ***
## TA_F_MDS 0.1035322 0.0049296 21.002 < 2e-16 ***
## SW_IN_F 0.0127810 0.0010750 11.889 < 2e-16 ***
## P_F -0.0233915 0.0026829 -8.719 < 2e-16 ***
## WS_F -0.1374643 0.0138279 -9.941 < 2e-16 ***
## CO2_F_MDS -0.0064341 0.0007675 -8.383 < 2e-16 ***
## PA_F 0.0077160 0.0032004 2.411 0.015922 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.064 on 16864 degrees of freedom
## (6137 observations deleted due to missingness)
## Multiple R-squared: 0.5982, Adjusted R-squared: 0.598
## F-statistic: 2790 on 9 and 16864 DF, p-value: < 2.2e-16
# Plot the model
plot(best.multi.lm)
It seems that the step forward algorithm is successfully implemented. However, it is important to choose the predictors very carefully because the final model could varies. The target, of course, must be removed before we can use the algorithm because if we use the target as a predictor, RSQ will be always 1. Additional, we must be aware of linear combinations. We must always know for what each variable stands for. Linear combination could disrupt our results. After that, we must decide whether the variable is a meaningful predictor. For example: a time-factor could be very important. Maybe the experiment-setting were outdoor and the climate factors were important. It could also be possible, that our experiment setting were indoor and there were always labor-conditions. That means, that the date and time is irrelevant. It is not always clear what is a “good choice”.
For task 1 we could have wrote a very similar function as for task two. But we wrote a function which is a little bit complicated as necessary. We did this to demonstrate how the algorithm works. The algorithm calculate a linear regression model for each predictor. With the table we are able to see that the model with the highest RSQ is not necessary the model with the lowest AIC.
For task two we wrote also a function. It is a step forward algorithm. First, the function calculate a bivariate linear regression model and chose the one which has the highest RSQ. After that, a multivariate regression model is calculated and again the one with the highest RSQ is chosen. For each round, we compare the AIC. If the AIC higher than the model before, we stop the calculation and we have found our probably best fit model. But here we have the same problem as describe above. Our calculation depends on our input. Therefore, we need to consider which variables we include in the calculation.
For demonstration, we made some function calls. We can easily see that the results are different. That means we must (1) chose wise (maybe with educational guess) (2) try different calls to be able to estimate how big the difference is and (3) document how and why we have what decided.
Inspired by: https://www.statology.org and the book: “Statistik” form L. Fahrmeir
Residuals vs. fitted plot
We can see the right line follows almost the dashed line. It is near to zero and approximately a horizontal line. That means the linearity is not violated and the assumption that we use a linear regression model were ok.
Q-Q-Plot
The Q-Q-Plot shows us that the residuals are almost following a normal distribution. The error is probably mainly statistical and not systematic. The plot shows also that the choice of a linear model was the right one because it follows almost the line with a 45° angle (It would then the values has a perfect normal distribution).
Scale-Location Plot
The red line should be roughly horizontal across the plot. If it so, then the assumption of homoscedasticity is probably right. The scattering should be random and there should be no pattern at all. Both are approximately true. The spread of the residuals is roughly equal at all fitted values.
Leverage-Plot
There are none values behind the cook’s distance. That means that there are no significant outliers which distort our analysis.
The plots are approximately the same as for the bivariate model and need no more discussion. Only the scale-location plot needs a few words: The red line is not horizontal. That means there could be a problem with the homoscedasticity. But is is not so dramatically and do not need an intervention.But we should keep this in mind.
We can see that RSQ increases fast for the first few steps. But then it slows down and suddenly, there is no further improvement and we can stop our model. For the AIC we can see a similar behavior. But AIC decreases instead of increasing. First, the decreasing is fast and slows down. In the last step the AIC increased a bit. That is the time we stop our calculation.
The RSQ is based on the variance decomposition theorem, which states that the variance of the dependent variable can be written as the sum of a part of the variance explained by the regression model and the variance of the residuals (unexplained variance). By definition, it increases with additional variables, but the growth flattens out. Actually, we should discuss the adjusted RSQ insted of RSQ. RSQ is always a bit higher than the adjusted RSQ. But it does does not matter which one we use in the algorithm. The algorithm stops, if the AIC increase. If we chose a final model, then we have to use the adjusted RSQ. AIC is a kind of quality measure for the models. It is based on log-likelihood and a correction term that takes into account the number of variables. At the beginning, AIC becomes smaller. When the influence of the correction term becomes too large, the AIC does not get smaller anymore and we have a basis for decision making for the model choice.
It is also important to note that RSQ does not simply add up, but must be recalculated constantly. This means that for a multivariate model, we cannot simply use the highest RR of the individual bivariate models or the variables with the highest correlation coefficients. This is visible in the figures “visualization of the development of RR and AIC (bivariate). That why we have to use the AIC for a good model choice and that is also why we recalculate the RR in our step forward algorithm in each round.